Estimating Uncertainty Intervals from Collaborating Networks

Effective decision making requires understanding the uncertainty inherent in a prediction. In regression, this uncertainty can be estimated by a variety of methods; however, many of these methods are laborious to tune, generate overconfident uncertainty intervals, or lack sharpness (give imprecise intervals). We address these challenges by proposing a novel method to capture predictive distributions in regression by defining two neural networks with two distinct loss functions. Specifically, one network approximates the cumulative distribution function, and the second network approximates its inverse. We refer to this method as Collaborating Networks (CN). Theoretical analysis demonstrates that a fixed point of the optimization is at the idealized solution, and that the method is asymptotically consistent to the ground truth distribution. Empirically, learning is straightforward and robust. We benchmark CN against several common approaches on two synthetic and six real-world datasets, including forecasting A1c values in diabetic patients from electronic health records, where uncertainty is critical. In the synthetic data, the proposed approach essentially matches ground truth. In the real-world datasets, CN improves results on many performance metrics, including log-likelihood estimates, mean absolute errors, coverage estimates, and prediction interval widths.


Introduction
Deep learning techniques have provided breakthroughs in a multitude of prediction problems; however, effective decision-making often requires accurate assessment of uncertainty in addition to prediction (Bellman and Zadeh, 1970). In a single continuous outcome, the conditional probability distribution can be used more effectively in decisionmaking. For example, consider forecasting future lab values of diabetic patients from electronic health record data. This forecast should be used only if it is high confidence, which depends on how recent and complete the data is. Therefore, we want to build a system that faithfully quantifies its uncertainty based on the available information.
Quantifying uncertainty has a long history in statistics and has recently been extended into neural network frameworks (Blundell et al., 2015;Gal, 2016;Blei et al., 2017). The outputs of these systems should ideally be statistically calibrated (Mitchell and Wallis, 2011), meaning that the nominal level of uncertainty should reflect the true occurrence rate of an event. Calibration has been extensively researched on binary classification problems, where out-of-the-box deep learning methods typically result in over-confident uncertainty quantification. In these cases, correction methods such as Platt scaling are necessary to adjust the predictions (Platt, 1999;Guo et al., 2017).
Poor calibration also hinders effective decision making in regression problems (continuous outcomes). Although various methods can estimate uncertainty for continuous outcomes such as heteroskedastic regression (Harvey, 1976), Bayesian approximate methods (Gal, 2016;Gal et al., 2017), ensemble methods (Lakshminarayanan et al., 2017), and quantile regression based methods (Tagasovska and Lopez-Paz, 2019;Koenker and Hallock, 2001), they can fall short due to model misspecification, training instability, or lack of generalizability (Kuleshov et al., 2018). Adjusting for poor calibration is a challenging task in continuous outcome spaces, as the number of events to calibrate is virtually indefinite. Additionally, customization is often required for different types of intervals (e.g., one-sided or two-sided, continuous or noncontinuous) or the proposed nominal levels of coverage (e.g., 70 %, 90 %, 95 %) used in a given application. In addition to well-calibrated predictions, it is necessary to have sharp, precise intervals. Given the same level of coverage, a sharper interval is preferred and is more informative (Pearce et al., 2018). Empirically, simultaneously ensuring sharpness and calibration is difficult as methods typically achieve sharpness by sacrificing calibration or the other way around.
In this manuscript, we introduce a new modeling framework capable of learning two networks to provides both faithful coverage intervals and sharp predictions. One of our networks attempts to learn the conditional Cumulative Distribution Function (CDF) given a collection of data observations. To help learn this network, we pair it with a second network that approximates the conditional inverse CDF. Despite the seeming similarity to an autoencoder with these paired functions, the networks must be learned with separate losses in a similar fashion as Generative Adversarial Network (GANs) (Goodfellow et al., 2014). Because these networks interact and must be consistent with one another, we refer to this method as the Collaborating Networks (CN). We show that the desired solution (the two networks give the true conditional CDF and inverse CDF) is a fixed point of the optimization scheme and that the approach yields a stable solution with asymptotic consistency under certain model classes. In the following sections we provide theoretical analyses and demonstrate empirical performance on two synthetic dataset and six real-world datasets. The code to reproduce the experiments is publicly available 1 .

Related Work
Uncertainty in binary classification is well-explored. In the classical logistic regression setting, the probability is usually well calibrated (Kull et al., 2017). However, deep networks become overconfident due to overfitting, which can be partially addressed by the usage of normalization and weight decay (Guo et al., 2017) or by intelligently varying the inputs (Thulasidasan et al., 2019). Platt scaling has been fairly effective (Platt, 1999). In this two-step method, the initial prediction is learned as p i on the training data, and then a small reserved dataset is used to fit q i = σ(a + bp i ) as the calibrated probability. Another frequently used nonparametric calibration method is called isotonic regression (Zadrozny and Elkan, 2002), where the interval from 0 to 1 is binned from the pre-trained network, and the observed proportions on the data replaces the original predicted probability.
The challenge of studying continuous problems is that it often requires the modeling of the full span of the conditional distribution. Classically, the tilted loss function in quantile regression provides a nonparametric framework to predict conditional quantile information for any given percentile q ∈ (0, 1) (Koenker and Bassett, 1978), which has been extended into neural network frameworks to make it simultaneously mimic the full distribution over the full support (Tagasovska and Lopez-Paz, 2019). Although quantile regression asymptotically learns the true conditional quantile information, it could be subject to underfitting or overfitting, failing to calibrate empirical coverage in practice (Rodriguez and Yao, 2017). Modern uses of quantile regression could also be combined with conformal prediction to obtain finite sample calibration (Romano et al., 2019).
Bayesian Neural Networks naturally form a scheme to generate uncertainty estimates (Neal, 2012). One can draw a posterior predictive distribution based on the data we observed using sampling methods (Neal, 2012). To address computational challenges, one can approximate the Bayesian solution by introducing dropout training as an approximation to Bayesian inference (Gal and Ghahramani, 2016;Gal, 2016), by using variational inference (Blei et al., 2017;Blundell et al., 2015;Wainwright and Jordan, 2008;Graves, 2011), or by using stochastic gradient sampling methods (Li et al., 2016). Approximate Bayesian approaches are sensitive to model misspecifications that can result in mismatch between the claimed credibility and reality (Pernot and Cailliez, 2017). Essentially, model misspecifications arise when there is discrepancy between the specified and the actual data generating process (Frazier et al., 2019). Adopting overly complex or simple model architectures, erroneous priors or parameter specifications, or an unsuitable choice of uncertainty model are different specific cases of model misspecifications.
Regression schemes can be modified to generate heteroscedastic uncertainty by training networks to produce both a mean and variance estimate under a Gaussian likelihood (Nix and Weigend, 1994). This formulation can suffer from instability and is prone to overfitting, but can be enhanced by adjusting the gradient calculation and training the mean and variance model separately (Skafte et al., 2019). One could add mild perturbations to the covariate space and combine several independently trained uncertainty models either with regression or other frameworks for a more empirically stable and calibrated uncertainty estimate (Lakshminarayanan et al., 2017). Mixing or combining more underlying models provides more flexible approximation to different forms of distribution functions (Brando et al., 2019).
Finally, uncalibrated models for continuous outcomes could be calibrated in a post-hoc fashion. For example, calibrated regression is a method that fits an extra isotonic regression after an initial model has been trained (Kuleshov et al., 2018), which requires an extra validation set to form a recalibration mapping between nominal level of coverage and the empirical level of coverage. One could also add calibration as a regularization term to the loss of original uncertainty model to enforce the model to yield calibrated predictions with a certain extent of introduced inductive bias (Utpala and Rai, 2020).
Compared to these approaches, our method is unique because it provides an approximation to any family of distribution functions with Lipschitz continuity. Under this framework, it ensures calibration and is not subject to model misspecification in any Lipschitz continuous distribution. The losses give straightforward gradient calculations. It is theoretically sound with a large sample property. We prove consistency to ground truth for broad model classes. We have adopted several techniques to stabilize the model training, and we have practical evidence of it being robust to extreme cases such as overfitting. Empirically, we can incorporate all training covariates to learn the conditional distribution function with our model that effectively generates the prediction interval for any possible quantile, which is shown to be precise and faithful in our empirical evaluations.

Preliminaries
Before introducing our learning methodology, we first set up the notation and network definitions.

Notation
Let X denote the input features, X ∈ X ⊂ ℝ p , with x denoting an observed sample. We denote Y ∈ ℝ as the continuous outcome variable with an observed values y. It is assumed to have a joint distribution function p(Y,X) and an underlying conditional distribution function, p(Y |X = x). Let y (q,X=x) denote the q′th conditional quantile, where ℙ Y < y (q, X = x) | X = x = q. For instance, q = 0.5 would yield the median of the conditional distribution. p(q) is a chosen distribution to generate the percentiles in (0, 1).

Neural Network Approximation Functions
The proposed method is based on jointly learning two functions. We will first define the functions along with their predictive goals. We define a function f θ (q,X) with parameters θ, which will be denoted as f for simplicity. The goal of this network is to approximate the inverse conditional CDF or quantile function of Y |X. An optimal function f should have the property that f(q,X)=y (q, X) , ∀q ∈ (0, 1), X ∈ X. We then define a second function as where z is located in the full outcome space,ℙ(Y < z | X). A perfect g should have the property g y (q, X) , X = q, ∀q ∈ (0, 1), X ∈ X. When both f and g perfectly match their goals, they satisfy the following properties that ∀ q ∈ (0, 1), g(f(q, X), X)=q .
(1) This property states that composing a well-learned CDF and inverse CDF function should produce an identity function. This identity is essential in connecting the two functions and their distribution properties, which will be exploited in the following to create a learning scheme.

Joint Function Learning
A good conditional quantile function f should have the following property to generate well-calibrated quantile estimates, (2) At first glance, a straightforward approach to achieving the property in Eq. (2) would be to directly adopt it as an objective (e.g., minimize the square loss E p(Y | X) [Y < f(q, X)] − q 2 ); unfortunately, this objective function's gradient comes from an indicator function that is ineffective for learning the network. We bypass this learning difficulty with our joint learning scheme, but still ensure the property in Eq. (2) is properly satisfied when our framework is optimized.
Specifically, the neural networks f θ and g γ are bestowed with two distinct losses, g-loss γ : E q p(q), x, y p(X, Y ) ℓ 1 y < f θ (q, x) , g γ f θ (q, x), x , (3) f-loss θ : E q p(q), x p(X) q − g γ f θ (q, x), x 2 . (4) The loss ℓ is a binary cross-entropy loss (or logistic loss), ℓ(b, a) = −b log a − (1 − b) log(1 − a). Eq. (3) and (4) are the losses in expectation; in practice, we would use empirical risk minimization. The distribution for quantiles p(q) can be chosen as desired. Any distribution that fully covers the (0, 1) percentile space satisfies our theoretical framework; in practice, we choose Unif(0, 1) (uniform distribution). A visualization of this proposed model framework is given in Figure 1. Under conditions similar to the theoretical claims in GANs (Goodfellow et al., 2014), these losses induce a fixed point for f and g with their desired properties (see Section 4.1).
The design of this two-loss framework can be understood as follows. When g is updated to minimize the the g-loss, f functions as a space searching tool to help g acquire information about the distribution function over the full relevant space. We demonstrate in our theoretical analysis that f needs only to satisfy mild conditions for g to be able to learn the optimal function. Essentially, if f varies constantly as a function of q and covers the full probability space, then g will learn the CDF by matching the relative count of these events to their estimated probabilities. We show in experiments that even when using a fixed f with a prespecified distribution, g is still able to yield good results. We are still motivated to learn f to make searching the space as efficient as possible, as a good representation of the inverse CDF function will better discriminate low density versus high density regions in outcome spaces and can improve performance with finite samples.
We demonstrate this effect empirically and note that it matches concepts in noise contrastive learning, where efficiency is dependent on how well the generated samples match the true distribution (Gutmann and Hyvärinen, 2010). Thus, we update f to minimize Eq. (4) to learn the distribution information directly from g. Hence, g is the main function that we use to learn the distribution information from data and f is regarded as an auxiliary function that better assists g in space searching when it gets improved during the training. Thus, although both functions target on learning the conditional distribution of Y |X, g is preferred in inferring the outcome uncertainties over the space explorer f after the training process is completed.
The g-loss and f-loss defined in Eq.
(3) and Eq. (4) are straightforward to optimize, and they are convex in function forms, which allows an alternating learning scheme with standard gradient methods. Note that most neural network architectures could be easily incorporated in this framework. We describe the full learning strategy in Section 4.2 and provide pseudocode in Algorithm 1.
Additionally, if desired, our model could be integrated into advanced neural network architectures, such as an LSTM, to enable it to produce uncertainties. This is demonstrated with a time-series dataset in Section 5.4.

Theoretical Results
The functions g and f are designed to learn the conditional CDF and conditional inverse CDF of Y |X. Here, we explore when the loss functions in (3) and (4) will lead to these goals. We limit our discussion to the family of distributions with Lipschitz continuity, which can be well approximated by the deep neural networks according to the universal approximation theorem (Lu and Lu, 2020). Suppose that f and g have enough capacity to represent the ground truth distribution functions within the Lipschitz continuous family. We can then show that the optimal solution is a fixed point of the training scheme. Below is a sketch of proof of these claims, with more detailed proofs available in Appendix A. We begin our analysis with g.
Succinctly, by fixing any {q, x} and letting f(q, x) = z, then the inner part becomes ℙ(Y < z | x)log(g(z, x)) + ℙ(Y ≥ z | x)log(1 − g(z, x)). For any function f(b) = −{a log b + (1 − a)log(1 − b)}, its unique minimum is attained when b = a. Therefore, g(z, x) is optimal The result is also robust to the distribution p(q) over percentiles as long as it has support over all of (0, 1). Our default choice is the uniform distribution Unif(0, 1). The Beta distribution such as β(0.5, 0.5) could as well be utilized if we want to emphasize the distribution on the tails. Proposition 1 reveals an interesting result: g has a fixed point at the optimal solution even when f is not optimal. In the meantime, this raises an open ended question on how 'sub-optimal' an f can be to ensure such property. In practice, each conditional distribution Y |X could vary on (−∞,∞), and having f properly covering all areas in (−∞,∞) is not realistic. Instead, we could narrow our attention to conditional distributions within certain percentile range, such as q ∈ (0.001, 0.999). In this way each Y |X is bounded, and we can always come up with a reasonable f to search the space. For example, a fixed uniform distribution f ∼ Unif(K 1 ,K 2 ) where K 1 is small enough and K 2 is large enough to cover the outcome space it will satisfy the assumption.
The optimality of g leads to an additional question, which is whether our estimate will actually achieve our optimal result. To do that, we sketch out a consistency proof of g that is independent of f. This assumption is critically dependent on the existing M-Estimation theory ( Van der Vaart, 2000). Prior to the statement of the theorem, we need to introduce additional notation. We define any learned g to be a function δ that comes from function space Δ. We make this switch in notation because the theorem is proved in the function space rather than the parameter space of g. We make the assumption that the ground truth CDF function g 0 is included in Δ as δ 0 . Note that using this function space is important for the theory because two different parameter settings in g can map to the same function. Let d be a distance measurement (e.g., absolute difference in L 1 or squared difference in L 2 ).
Next, note that the g-loss of a single sample g-loss i is: Let the M-estimator M n be the n-sample average of the objective evaluated at function g: M n (g)= − ∑ i n g−loss i /n and M(g) = −E(g-loss i ) (true expectation). With that, we can now state the theorem: Theorem 2-For ϵ > 0, If the following three conditions are satisfied, then d δ 0 , δ n P 0, as n → ∞.

3.
The sequence of estimators δ n satisfy M n δ n ≥ M n δ 0 − o p (1) To show that our optimal finite sample estimator is consistent (g n g 0 , the ground truth conditional CDF (δ 0 in theorem)), we need to satisfy these three conditions. A detailed derivation can be found in Appendix A, but we will give some intuition on the conditions. Note that in our derivations we limit the function class to those that satisfy Lipschitz continuity in order to satisfy these conditions. Lipschitz continuity can be imposed in neural networks (Arjovsky et al., 2017) and is a realistic assumption in many uncertainty quantification problems because of the smoothness over q. The main idea of the consistency proof is to link the function proximity δ → δ 0 through their proximity in the evaluated objective M(δ) → M(δ 0 ). The first condition is a form of uniform convergence in probability, and describes that the finite sample objective function should well-approximate the objective function in expectation as the number of samples increases regardless of the chosen δ. The second condition states that the ground truth δ 0 is the only setting that maximizes the objective function in expectation. The third condition states that we should have a sequence of functions δ n (estimator) estimated at each finite sample of size n, that approximately equals δ 0 in the the evaluation of finite sample objective M n . Their difference in finite sample objective M n is commensurate with a small quantity o p (1) which converges in probability to zero as n → ∞.
Then by the large sample property of condition (1), the limit of the sequence will dominate the objective function in expectation. Thus, the δ n should be a consistent estimator for δ 0 , the only maximum. Otherwise, it cannot dominate the objective function, which leads to a contradiction.
Succinctly, for our smooth model class, our estimator g is consistent.
While we assume that the optimization method will minimize the finite-sample objectives, recent theoretical advances in deep learning suggest that it may be a reasonable assumption in practice (Du et al., 2018), which covers related model setups.
Next we explore the fixed point properties on f: Proposition 3-When the g-function is ideal, then the f-function is optimal under Eq. (4). The optimum is attained when f captures the inverse CDF, i.e. f(q, x) → d Y |X given q ∼ Unif(0, 1).

Proof [Sketch of Proposition 3]:
For an ideal g-function, Including this in our f-loss gives If we make q = ℙ(Y < f(q, x) | x), then f-loss= 0 and the loss is optimal. Let the distribution of Y |x be represented as F x . Then we have By combining Proposition 1 and 3, it is clear that our ideal functions are a fixed point when we have access to the complete data distribution and our learned functions have sufficient complexity. We observe from this proof that the fixed point of the f-loss in Eq. (4) relies on g's optimal solution to be achieved first. However, since f does not need to be optimal for g to learn effectively, we are satisfied with getting f close to the true distribution to more efficiently search the space.
As a final note, since our theory is more robust on g than it is for f, we expect that using g to capture uncertainty will perform better, which is also verified empirically in Section 5. Using f, though, is still a competitive solution, demonstrating that both networks are effectively learned with these losses.

Learning Initialization and Stabilization
As demonstrated in Section 4.1, the learning of the g-function has a fixed point at the optimal solution as long as the f-function possesses some mild properties. Regardless, we prefer the joint learning scheme over g and f because a better f enables more efficient learning on g. Second, we note that f can collapse if g becomes "too good" (100 % prediction confidence), as the loss of f is embedded in g, and f learns the inverse CDF by inverting g. Therefore, it is important that g is initialized properly and stays stable to prevent the f function from experiencing mode collapse (Creswell et al., 2018;Salimans et al., 2016), deteriorating the efficiency of space exploration.
For the initialization, we start by training g independently of f, also known as the pretraining step. Instead of using f θ (q i , x i ) to randomly generate samples from the conditional distribution, we adopt a generator p(Z) with enough variability to help g conduct some initial explorations of the distribution in the whole outcome space. As is shown Section 4.1, the space searching tool p(Z) does not change the optimal value of g in expectation, so this is a reasonable initialization technique. Here, we pick p(Z) to be a uniform distribution ranging below the smallest and above the largest observed outcome U(min(y) − K, max(y) + K), such that z ∼ p(Z). Large and positive K enforces the initial exploration of g on a larger space. It could also be chosen as the marginal empirical distribution of outcomes, as long as enough variability is involved. This pre-training step adopts the following loss, We will show in Section 5 that learning g with this pre-training procedure already makes a competitive uncertainty method. For all the experimental datasets we included, the pretraining process plateaus within 1,000 training iterations, which we interpret as a sign of quick stabilization. The pre-training is further discussed in Appendix D. We usually prolong the pre-training iterations to be between 10,000 to 20,000 to fully stabilize g by searching the outcome spaces more extensively. This pre-training process assists f and g in reinforcing each other during the joint learning process.
After g is pre-trained, we use the property described in Eq.
(1) to robustify and refine the training under the full collaborating networks framework by jointly learning f and g. Specifically, for ideal functions the mapping g(f(u,X),X) = u reduces the g-loss to −[q log(q) + (1 − q)log(1 − q)] for a given q. Thus, a requirement for a well-trained network is that the output from the g function actually matches the chosen p(q) distribution, which it should do in an optimally trained model. We enforce this condition by constraining the first and second moment of the logits in the g-network.
g-loss γ : Here, σ −1 (·) is the inverse sigmoid function (or logit function), σ −1 (q) = log(q/(1 − q)), and maps from a probability to logits. Since we define q ∼ p(q) as the uniform distribution in practice, these moments are straightforward to calculate as E q Unif(0, 1) σ −1 (q) = 0 and E q Unif(0, 1) σ −1 (q) 2 ≈ 3.29. While at first glance, enforcing these constraints seems like it may require a detailed optimization algorithm, it can be accomplished approximately by using batch normalization functions (Ioffe and Szegedy, 2015). Instead of the typical batch normalization function, it is implemented with the learned affine transformation on the output replaced by a predefined scale and shift to approximately match these idealized first and second moments. This technique forces the predicted coverage to roughly match the implied optimal distribution over q, stabilizing learning and providing additional information to the model to reduce overfitting. This regularization is another merit of learning the g and f jointly. Another indication of this regularization is that the g-loss will plateau around 0.5 after f appropriately learns its inverse, since E q Unif(0, 1) [ − qlog(q) − (1 − q)log(1 − q)] = 0.5.
This phenomenon in the training loss is observed in each of our experimental datasets, and more details can be found in Appendix D. In practice, we prefer to have q ∼ Unif(0, 1), as a well learned inverse CDF with uniform quantile generator emulates the data generating process of the true distributions which discriminate high versus low density regions. Because of this, we expect that a learned f could be weak in tails because they cover a wide region with low densities compared to the high-density middle region. This phenomenon is observed in the experimental results of Section 5. This limitation could be addressed by choosing a heavy-tailed q distribution (e.g., q ∼ Beta(.5, .5)); however, since we primarily evaluate on the g function this is not a limitation in practice.

Algorithm 1
Full Learning Algorithm [Input]: Training samples {x i , y i }, for i = 1, ···, T. A random generator q ∼ p(q) for percentile and a random generator p(Z) to generate value z for space searching during the pre-training.
[Optional] Reduce the raw covariates of X.
Pre-training to initialize g: Sample a mini-batch from training samples and generate z i ∼ p(Z) for space searching.
end for Joint learning f and g: for iter = 1, ···, N iter do Sample a mini-batch of training samples and generate percentiles q i ∼ p(q) for each sample.

end for
With these aforementioned techniques, the learning algorithm has been highly stable and robust in our empirical evaluations. This full procedure is shown in Algorithm 1. In our setup, we define g as a multilayer perception (MLP) with 2 hidden layers and f with 3 hidden layers. The full model specifications are given in Appendix C. The computation cost is comparable to training two regular MLPs. When we trained the networks with a single NVIDIA P100 GPU, the pre-training process ran at 483 it/s, and updates of g and f in the joint learning process ran at 152 it/s with batch size of 128 and an input feature space size of less than 50. Empirically, 20,000 pre-training iterations and 20,000 joint training iterations are sufficient to yield good uncertainty estimates for dataset with fewer than 10,000 samples.

Results
In the following sections we report on our empirical simulations. We explore the impact of learning the f network in Section 5.2. We then evaluate our proposed method on synthetic data to explore its performance compared to competing methods and the optimal ground truth functions in Section 5.3. Then we compare methods on multiple real-world datasets in Section 5.4, showing improved performance across a variety of tasks. We include three variants of CN in this experimental section to evaluate CN's theoretical properties empirically. The first two variants come from the joint learning framework over g and f, and we denote the distributions estimated by g as "CN-g" and f as "CN-f." The third variant is learn g with a fixed f, which is denoted as "g-only."

Metrics
We base our uncertainty estimation evaluation on four main criteria, which are described mathematically below. First is calibration, which measures how well the predicted coverage of certain interval matches with the actual coverage. Second is sharpness, which evaluates the width of the interval. For example, if two methods both have calibrated intervals, but one method has a much smaller range of uncertainty, it is preferred. Third, we evaluate how well we capture the full conditional distribution by evaluating a discretized approximation to the conditional log-likelihood. Fourth, we evaluate prediction of the median of the data by evaluating Mean Absolute Error (MAE). We note that it is possible to evaluate Mean Squared Error (MSE) as well, but this requires averaging over the full conditional CDF. Median estimates, as evaluated by MAE, are more natural to evaluate in these methods. (Kuleshov et al., 2018;Gneiting et al., 2007). A predicted nominal quantile is well calibrated when E p(Y | X) y q 1 , X < Y < y q 2 , X = q 2 − q 1 ,
In practice, we could also construct intervals with different widths on which the calibration property should still hold. Hence, we introduce a more generic notation for intervals with q (e.g. 95 %) nominal level as I q,x for Y |X = x. The miscalibration at q can be quantified as the absolute difference between q and the true probability of Y |X = x falling in this interval: |u−P Y |X=x (Y ∈ 1 u,x )|. In practice, only one or a few samples given X = x can be observed, hence the miscalibration is aggregated over the full data, and has become a marginal concept. The miscalibration at q can be defined as follows: This quantity cal q can be evaluated and averaged over a sequence of q j ∈ {q 1 , …q M } between (0, 1) to evaluate calibration on the full spread of outcome distributions. The importance of each nominal level q can also vary by assigning different weights w q , which creates a metric: Note that several definitions of the interval could be used. For the purpose of our study, we pick two-sided equal tail interval [y q/2,X , y 1−q/2,X ] as our uncertainty objective. In empirical evaluation for cal, we picked equally spaced 8 percentiles, with q 1 = 0.02, q 8 = 0.98 and all weights w q j set as 1. In some scenarios, 90% intervals are of special interest for decision making, so we also report the empirical coverage for intervals at the nominal 90% level, Note that the metrics cal and 90% do not discriminate between a marginally calibrated method and a conditionally calibrated method as both can perform well on these metrics. Thus, we require additional metrics to fully evaluate the methods.

Sharpness-At first glance
, evaluating sharpness appears straightforward because a sharper method should produce narrower interval given any proposed nominal level q. However, simply reporting the interval width under any nominal level q does not form a fair standard for comparing sharpness, as it could reward some less calibrated methods which achieve sharpness by sacrificing calibration and being overconfident in interval estimate. While some prior works do report sharpness as a function of the nominal interval and evaluate the sharpness versus calibration tradeoff (Gneiting et al., 2007;Thiagarajan et al., 2020), we instead focus on making a visual approach to characterize the trade-offs between empirical coverage generated by each method and the predicted average width.
Explicitly, given a method, for any proposed nominal level q j , we first generate the uncertainly interval for each validation sample i with lower and upper bound I q j , i = l q j , i , u q j , i . The median interval width under q j can be represented as widtℎ q j = med i = 1 N u q j − i , l q j , i . Then we calculate the true frequency also known as the empirical coverage value of outcomes truly covered by these intervals: N I y i ∈ I q j , i /N. We repeat this procedure for different q j and generate a mapping between q j , widtℎ q j . By plotting q j against widtℎ q j , we produce a curve characterizing the empirical coverage and interval sharpness. Sharp methods will correspond to a lower curve in this visualization result.
This will allow the reader to understand the sharpness with regards to the actual coverage, as this evaluation does not reward methods that achieve "sharpness" by producing overly confident intervals.

Predictive Log-Likelihood Approximation:
Goodness of Fit-In order to assess how well the predicted conditional distribution actually fits the data, a standard statistical approach is to evaluate the log-likelihood. Because the algorithms we are comparing more naturally produce intervals rather than probability density functions, this is challenging to do directly. Instead, we will use a "goodness-of-fit" (gof) metric that approximates the log-likelihood by using a discretization of the interval. Specifically, we discretize the real line into mutually exclusive bins B 1 ,B 2 , · · · ,B m , and ∪ i B i = ℝ. The discretization approximation to our log-likelihood is then given by In all our experiments, we picked 10 bins, with B 1 ∈ (−∞, a 1 ], and B 10 ∈ (a 9 ,∞), with a 1 and a 9 denoting the fifth and ninety-fifth percentile of the empirical distribution on Y. The intermediate bins were chosen to be equally spaced between those intervals.
Note that the log-likelihood values here will be maximized in expectation when the estimated distribution exactly matches the conditional distribution, which could also be viewed as comprehensively assessing the calibration and sharpness. Note also that the values of our gof metric are dependent on the locations of a 1 , …, a 9 and the number of bins.
However, these are held constant for all algorithms for a fair comparison.

Mean Absolute Error-
The mean absolute error is minimized in theory if the point estimate captures the true median of the outcome. For each observation i, we estimate its conditional median y .5, X = x i , and define the MAE as follows:

Impact of Learning the f-Network
Since f only needs to satisfy mild conditions for g to be able to learn effectively in the asymptotic limit, it might seem unnecessary to learn f at all.
We set up a synthetic example to evaluate the impact of learning f in the finite sample case.
To facilitate visualization, we draw N = 100 equally spaced points between [−0.5, 0.5], {x 1 , · · · x 100 }, to form a covariate space with a single dimension. For each subject i with covariate . The range of the covariate, [−0.5, 0.5], covers two full periods of the trigonometric mean function, sin (4πx i ), and the heteroskedastic error assigns larger outcome uncertainty to subjects with larger mean values. Appendix C gives the full training details.
We adopt an overparameterized neural network architecture, which creates an interpolating setting in a typically trained neural network (Belkin et al., 2019). We first demonstrate that overfitting occurs for the point predictions from the mean squared error (MSE) loss, the conditional median (QR_0.5), the conditional 25'th quantile (QR_0.25), and the conditional 75'th quantile (QR_0.75) estimated by the quantile regression. Quantile regression is designed to approximate any conditional quantile information for a given q ∈ (0, 1). These methods all collapse to the observed outcomes and present little spread as shown in Figure  2(a).
Next, we evaluate the learned distribution for different variants of CN by comparing their estimated medians and Inter-Quartile Range (IQR) versus the ground truth. First, we set up the g-only approach with f fixed as the uniform distribution, U(−2.5, 3.5), which we denote as U-g. We do not enforce the moment matching as in Section 4.2. U-g does not collapse (Figure 2(b)), but its median and interval estimates are poor. Next, we set f to the ground truth conditional CDF. This setting is infeasible in practice, but will evaluate performance with the theoretically optimal f. We denote this case as T-g, and it gives good agreement to the ground truth, as shown in Figure 2(c). Finally, we use the full framework of CN to learn the outcome distributions with the results of learned g (CN-g) and f (CN-f) functions presented in Figure 2(d) and 2(e) respectively. CN-g and CN-f both have good median estimates, but CN-g's interval estimates are sharper and closer to the true values. While the performance of full CN is not quite as good as T-g in constructing sharp intervals, it is drastically better than using a suboptimal f function and more precisely predicts the spread of the true distribution.
From this setup, we see that while using a suboptimal f does not effect g in the asymptotic limit, it does in the finite sample case. Since setting f to the ground truth distribution is not realistic, it is important to adopt the collaborating structure. To address how this impacts the function's convergence to the true distribution, we set up a second study to explore how these different modeling approaches match the theoretically optimal distribution (true distribution) as N increases in this generation procedure using the gof metric. Training details can be found in Appendix C. The performances are visualized in Figure 3.
When f is the true inverse CDF, T-g stabilizes at small values of N with excellent performance. The joint learning scheme (CN-g) has only a small performance margin to when we known the inverse CDF. On the other hand, U-g and CN-f's curve of gof progress much slower, requiring many more samples for the same performance. In practice when the inverse CDF is unknown, the collaborating learning scheme should be preferred over a fixed f since it leads to faster convergence as a function of data samples. Note, though, that all methods converge to the ground truth distribution asymptotically, which is consistent with our theory.

Comparisons on Synthetic Data
We propose two synthetic cases to evaluate how well the Collaborating Networks recover the ground truth conditional distribution, which we denote as the theoretically optimal uncertainty estimate (TH). The first synthetic case uses a heteroskedastic Gaussian distribution, which offers scale and location transformations. The second case is simulated under the Weibull distribution that varies through scale and shape transformations. We include the three variants of CN: CN-g, CN-f, and g-only (with fixed f) to further assess their empirical performance. Moreover, six extra methods that are capable of estimating outcome uncertainties are also encompassed for full comparison. The first method is MC Dropout (DP), an approximate Bayesian inference for Gaussian process assuming homoskedastic outcome error (Gal and Ghahramani, 2016). The second is Concrete Dropout (CDP) which uses the same dropout scheme while combining heteroskedastic error (Gal et al., 2017). Then we have the exact Gaussian process regression (GPR) and parametric Gaussian process regression (PPGPR) (Jankowiak et al., 2020). The latter accounts for homoskedasticity and emulates the outcome distribution with variational methods. The fifth method is the Deep Ensemble Model (EN), which is constructed by fitting and combining five heteroskedastic regressions (Lakshminarayanan et al., 2017). The last method is the Conformilized Quantile Regression CQR, a two-step method that first estimates intervals through quantile regression and then adjusts the interval with residual errors for finite sample calibration (Romano et al., 2019). CQR is not scalable for full distribution estimation as it estimates one nominal level q at a time. Hence, CQR is skipped for gof evaluation.
In the two synthetic cases, each method is run for 10 replications. The means and standard deviations of their evaluation metrics over these replications are reported. In these synthetic experiments, the MAE evaluation is compared to the ground truth medians which are known to us. This is to help us better gauge a method's closeness to the ground truth value. The detailed model specifications and tuning strategies for all methods are described in Appendix C.

Scale and Location Shift
With a Gaussian Distribution-The first synthetic data follows a Gaussian distribution with a unique mean and variance value for each sample, y i N μ i , σ i 2 . Specifically, μ i N(0, 4), σ i ∼ Unif(0.5, 2, 5), and the covariate space x i = [μ i , σ i ]. We generate 1,000 samples in total and each replication is based on randomly splitting the dataset in a 7:3 ratio for training and testing. This heteroskedastic Gaussian example represents an ideal setup for some of the competing methods: CPD, PPGPR and EN. DP and GPR both assume the correct family (Gaussian), but slightly misspecify the uncertainty model by assuming homoskdacticity rather than heteroskedasticity. On the other hand, CN and CQR do not require any information on either the distribution or uncertainty a priori, and they learn such information through their distribution approximations.
The empirical metrics are given in Table 1. All methods provide good marginal calibration results by having (cal) less than 5% and nearly 90 % actual coverage for the nominal 90 % intervals. Based on these results alone, all methods are competitive. However, since both cal and 90 % are marginal quantities, they do not penalize a method for failing to capture the heteroskedasticity. However, the gof and MAE metrics differ between the methods, revealing differences in how well they capture the distribution. In the gof evaluation, CN-g, EN and CDP all essentially match the theoretically optimal value (TH). We assume that the heteroskedastic Gaussian assumption in EN and CDP give them the edge in performance.
Although PPGPR assumes heteroskedasticity, its variational inference strategy might cause it to lose some extent of precision in discerning nuanced details of the ground truth distributions. In MAE estimation, CN-g and CN-f prevail by more accurately capturing the true medians. It is also suggested by our joint learning scheme that CN-f helps refine CN-g more in the middle spread of the outcome spaces. CQR has a less competitive MAE.
Despite having an unbiasedness property in large samples, this is not guaranteed for a finite sample approximation.
Moreover, we could assess how each method responds to the heteroskedastic variance. Under the Gaussian distribution, the optimal 90 % interval scales with σ i (Lehmann and Romano, 2006). To visualize the results, we estimate the 90 % intervals for all evaluation samples and plot their interval widths against the optimal interval width. Having all points distributed near the 45 degree diagonal line is an indication of accurately capturing the heteroskedasticity. The result is summarized in Figure 4. Here, CN-g and CN-f visually have the best agreement with TH, as their resulting widths scatter narrowly and only deviate in the extreme widths. g-only, CDP, PPGPR and EN all capture the heteroskedasticity but present larger variations. CQR learns the basic trend of how uncertainty varies, but with a lot more estimation variability. DP and GPR have intervals that vary only slightly because they assume a fixed variance on top of functional uncertainty.
We further evaluate whether the estimated conditional CDF reproduces the ground truth by sketching the estimated distributions in Figure 5. Note that CN-g and g-only almost perfectly mimic the ground truth, and this holds up across a variety of input features. Except for the tail regions, CN-f provides a close match to TH as suggested by the property of our joint learning scheme. Compared with the three variants of CN, other methods less accurately describe the individual distributions.

Scale and Shape Transformation with a Weibull Distribution-
The second synthetic example is based on the Weibull distribution. The Weibull distribution's support is on non-negative values and is frequently used in survival analysis to model the relationship between failure and time (Collett, 2015). The Weibull distribution has two parameters that define its scale (λ) and shape (k). Each sample y i is generated from a Weibull distribution with a unique scale λ i ∼ Unif(0.5, 2) and an unique shape k i ∼ Unif(1, 5). The input covariates to the method are x i = [λ i , k i ]. We generate 1,000 samples and each split of training and testing sample follow a 7:3 ratio.
Note that the Gaussian distribution can provide approximations to the Weibull distribution (Kulkarni and Powar, 2011), but they are ultimately from two different distributional families. Therefore, all Gaussian models are subject to misspecification. Table 2 summarizes the metrics on these results. CN-g dominates three out of four metrics and is comparable to the ground truth (TH). EN and CDP are no longer as competitive as CN-g in this case. Due to their flexibile heteroskedastic Gaussian structures, they are still able to give reasonable uncertainty estimates, and they both outperform the other homoskedastic Gaussian methods.
As the Weibull distribution is not a symmetric distribution like the Gaussian, all Gaussian based approaches fall short in estimating the true median. CQR still calibrates the marginal uncertainty well, but does not accurately estimate the median values.
Considering the Weibull distribution's utility in survival analysis, we additionally estimate the survival probability to compare how well each method captures the scale and shape information. Given a method, we estimate the survival probability beyond 1 ℙ Y i > 1 | X i = x i for all evaluation samples and plot their estimates against the ground truth values. From Figure 6, we observe that CN-g has the best agreement with TH; CN-f and g-only also estimate the survival probability well, but have more variability. CDP, PPGPR and EN perform better than DP and GPR on average in capturing the survival probabilities, but are still restricted due to their model misspecification.
Finally, we use each method to estimate individualized CDF and plot them against the ground truth for some random drawn examples. The result is shown in Figure 7. Here, CN-g and g-only outperform the other methods, and CN-g gives closer distribution approximation in the second example than g-only. CN-f provides good estimates in the middle but not on the tails, likely due to the Weibull distribution's heavier tail. The three variants of CN assign nearly zero probabilities to the outcome region near 0, which exemplifies their good understanding of Weibull outcomes after proper learning. On the other hand, all Gaussian methods give relatively large probabilities to the outcome region near 0 especially in the second example in Figure 7(d). Their struggles are attributed to approaching an asymmetric family of distribution with their symmetric forms.

Comparisons on Real-world Datasets
In this section, we evaluate our methods on six real-world dataset examples. In addition to the comparisons on CN-g, CN-f, g-only, DP, CDP, GPR, PPGPR, and CQR used in the synthetic experiments, we include calibrated regression (CR) that provides post-hoc recalibration to a trained model (Kuleshov et al., 2018). As DP is not always guaranteed to be well calibrated (Gal and Ghahramani, 2016), we use CR as a second calibration step for DP (DP-CR). The introduction of CR is also helpful in demonstrating that our sharpness evaluation is invariant despite a model's calibration.  , 2018). To predict the outcome distributions, we followed two strategies. First, we fit an LSTM model for mean prediction and then use the latent representation extracted from LSTM as input features for all competing methods. For CN, we purposed a second strategy to define the g and f networks as LSTMs: CN-g (LSTM), CN-f (LSTM), g-only (LSTM).
These full networks are optimized collectively.
We demonstrate with all six datasets how our method can scale and adapt to complex data structures and be flexibly combined with many network architectures. Training and evaluation follows a 0.6/0.4 split. Each method is run 10 times and we report their standard deviations on the evaluation metrics in the first four datasets. In the last two datasets, each method is evaluated on a single split due to the higher cost of computation. The detailed preprocessing process can be found in Appendix C, which includes removing missing values and encoding categorical variables. The following Table 3 summarizes the sample size and input feature of each preprocessed dataset. Table 4 gives the calibration results and the nominal 90 % interval coverage. CN-g and g-only outperforms other competing methods in calibrating the nominal interval coverage in 5 out of 6 cases. The CN-f is capable of generating intervals that have fair calibration but is not as strong. As Airline and EHR are two large datasets, CN's large sample property is highlighted and we witness improved calibration for CN-g,CN-f and g-only. CQR also has consistently good calibration results across the different cases. All heteroskedastic Gaussian approaches on average generate uncertainties that calibrate outcomes better than their homoskedastic counterparts. DP's performance varies, and it fails to calibrate well enough nearly in all cases. However, after adjusting DP with CR (DP-CR), the overall calibration is effectively rectified.
In addition to cal, we also visualize how well a method calibrates each of the proposed nominal levels. Figure 8 summarizes the results for Airline and EHR datasets. In these two datasets, we observe that the three variants of CN barely differ from the true nominal levels, which provides further evidence on CN's consistency and stability in calibrating all nominal levels simultaneously. It is also shown that CN-f calibrates well at lower nominal levels but is weaker for larger nominal levels, consistent with the finding that CN-f is weaker at estimating tails.
Then we turn to two model fitting metrics, as MAE reflects the accuracy for the median estimates and gof reflects the accuracy for the distribution estimates, shown in Table 5. The CN methods dominate in almost all datasets, with CN-g typically performing the best. G-only marginally outperforms CN-g in small datasets with less than 500 samples (CPU and MPG) in gof, and closely follows CN-g in large datasets. G-only's edge in small datasets and in its combined version with LSTM could be due to its single loss structure, which eases its parameter optimization. CN-f still excels in capturing the middle spread of the outcome (MAE), but is less so on gof because it struggles with the tails of the distribution. As a Gaussian distribution provides a close approximation to many families of distributions and also because of the central limit theorem, Gaussian based approaches provide good uncertainty estimates in many cases such as in CPU and MPG, which is reflected by their gof. Nonetheless, in the cases where the outcome distribution is non-symmetric and less Gaussian-like, CN's advantage becomes more salient as it approximates a wider spectrum of outcome distributions. For example, in the crime data, the outcome is the number of violent crimes per 100K population, and the Poisson distribution is usually the appropriate choice for this type of outcomes (Short et al., 2008). Since most of the regions have very low crime rate 4 , a larger sample size might be required to approximate the distribution as a Gaussian-like. CR efficiently re-calibrate DP via a two-step procedure, and has also rectified its understanding of the outcome uncertainties by augmenting the model fitting (gof). Figure 9 summarizes the interval sharpness information by plotting the true interval coverage against the median interval width for each method. A lower curve indicates that a method generates sharper intervals. In most tested tasks, CN is either the sharpest or equally sharp to competing methods. In addition to the evaluation on numeric metrics, the sharpness plots also supports CN-g's advantage in more accurately capturing the conditional distributions as it achieves both calibration and sharpness instead of trading one for another. We also learn from this sharpness plot that CR's recalibation on DP does not enforce sharper intervals. CR improves calibration by adjusting nominal level to match up with the empirical level. However, it does not extract extra information from the data.
In the EHR dataset, as patients did not visit hospitals in a regular pattern, their visiting times are heterogeneous, which enables us to further assess how each method responds to this heterogeneity. We use the 90% interval width to study heterogeneity in visiting times, since each method is able to reach approximately 90% true coverage given 90% nominal levels.
For each of the following criteria on last observation time(t): t ≤ 8; t > 8, we randomly select 2,000 observations and compare the estimated interval widths among different methods. The result is encapsulated in Figure 10 with boxplots. First, we notice that as times increases, the interval widths generally get larger with more spread, which indicates these models' agreement on giving larger variability and uncertainty to the data points with increased time. Under each time, the position of CN-g's IQR is lower than the others. It also reflects CN-g's sharpness as it reaches approximately the same true coverage but with mostly narrower intervals. Overall, CN's joint learning framework shows its capability in drawing reliable uncertainty estimates for large-scale complex data, and producing significantly sharper intervals. These uncertainty estimates can be used to derive future values for patients, and our empirical results suggest that the uncertainty intervals are highly trustworthy.

Discussion and Conclusion
In this paper we propose a collaborative learning scheme by simultaneously training two neural networks that characterize the CDF and inverse CDF of the conditional distribution P(Y |X). The computational cost of training CN is approximately the same as training two MLPs with fast convergence. In all of our empirical evaluations, CN's performance was stable with varying architectures. Even with overparameterization we did not see a significant change in the interval quality, whereas we observed overfitting in this scenario in several competing methods. In analyses of real-world datasets and synthetic data, our method showed its capability in drawing reliable uncertainty estimates from small to largescale data with both non-temporal and temporal data structures. Empirically, our proposed method gives more accurate estimates of coverage and improved sharpness compared to the competing approaches. The method is supported by our theoretical analysis, and appears to be robust in practice.
In Supplemental Section E, we show that CN has the potential to be extended to multioutput problems. The computation cost of our proposed extension increases linearly with respect to the size of output. Moving forward, we will also consider extensions to causal inference to model the heterogeneous treatment effect for each individual and focus on interpretable modeling. grant (UL1TR001117), and by Duke University Health System. The CTSA initiative is led by the National Center for Advancing Translational Sciences (NCATS) at the National Institutes of Health.
The contents of this manuscript are solely the responsibility of the authors and do not necessarily represent the official views of any of the funding agencies or sponsors.

Appendix Appendix A. M estimator for the consistency
We start by restating Theorem 1 in Section 4.1 for an M-Estimator due to Van der Vaart (2000):

Theorem 1
For ϵ > 0, if the following three conditions are satisfied, then d(δ 0 , δ 0 ) p 0,, n → ∞ 1. In this theorem, we define Δ as a function space. δ 0 is the ground truth. M n represents the sample average of objective as a function of δ, and M represents the expectation as a function of δ. The whole theorem to prove the consistency of estimator δ n is based on a maximization framework.

A.1 Marginal Consistency
We first prove the consistency of our estimating function in marginal form without X ⊂ R d . Throughout the rest of this document we use K as a positive constant that may have different values in different situations.

A.1.1 Consistency of g-network
Defining a Alternative Form of the Objective Function.: From the g-loss, we have where f(q) can be replaced by any C ∼ G(c). G(c) is a pre-defined distribution, which does not influence the consistency as long as it has support over (0, 1).
We assume that the distance measurement is the cumulative probability density of C, i.e. μ(c) = ℙ(C ≤ c). Based upon that, the distance between two functions can be defined as Then our g-loss can be re-written as: To use Theorem 1 to prove the function consistency, we need to transform the original minimization problem of our g-loss into a maximization problem. The new objective function is: With the above objective, minimizing the g-loss is equivalent to maximizing the the objective function with respect to M(g). Further, define where g is the general form for a CDF function and therefore should be monotonically non-decreasing.
Defining a New Parameter Space.: In reality, it is impossible to learn the full span of the distribution with limited training samples (i.e., we get very few samples on the tails). For inference, we are only interested in exploring a partial range, [ϕ 1 , 1 − ϕ 2 ] (i.e., [0.025, 0.975]). Assuming that there exist two real numbers low and high, such that for positive small numbers ϕ 1 and ϕ 2 , g 0 (low) = ℙ(Y ≤ low) = ϕ 1 and 1 − g 0 (high ) = ℙ(Y > high ) = 1 − ϕ 2 , where g 0 is the ground truth CDF. Thus, we bound the domain of the exploring space to be within [low, high], which can always be adjusted to cover the inference of interest.
Based upon the above claim, δ (function in Theorem 1) can be further extended as any functions within the following family.
ℱ: Monotonically non-decreasing [low, ℎigℎ] ϕ 1 * , 1 − ϕ 2 * , where 0 < ϕ 1 * < ϕ 1 , 0 < ϕ 2 * < ϕ 2 (that is, Δ = ℱ for Δ in Theorem 1). It is always possible to find the range, since we can choose small number or scale the variables in practice. Under this setup, since we are only interested in learning a fraction of the distribution function within a bounded domain, it is possible to define a well-behaved function to generate the output C. Specifically, we define C ∼ Unif [low, high], where each point is placed with equal importance, then according to Van Der Vaart and Wellner (1996), for every r ≥ 1(in our case r = 1 for the L 1 norm), where we choose P = μ, the probability measure for C. Note that N [ ] is the notation for bracketing number, which stands for the complexity of the family of functions (Van Der Vaart and Wellner, 1996). It is trivial to show that ℱ ⊂ ℱ′ when we restrict the domains of all functions in ℱ′ to be [low, high]. Then we have the bracketing number for our function space logN [] ϵ, ℱ, L r (P ) ≤ K(r)(1/ϵ) .
Based upon the above inequality, the new family of functions is defined as with ℳ = m g , g ∈ ℱ . (16) We can easily construct a bracket for this new function by bracketing m g with Then by mean value theorem |log x| ≤ K|x − 1|, for any x with a positive lower bound, the probability measurement for any space (Δ,C) and measurement P δ,c , there is P δ, c m i u − m i l = P δ, c m i u − m i l Hence, our newly defined ℳ in Eq. (16) is from the Glivenko-Cantelli class. It ensures the uniform convergence of the functions in ℳ, and therefore condition 1 in Theorem 1 follows naturally (Van der Vaart, 2000).
The second condition focuses on proving that the expectation of the loss function has enough separation between two different functions. Let g 0 denote the ground truth gfunction, then the difference of evaluated objective function M under two function g and g 0 is M g 0 − M(g) = E c g 0 (C)log g 0 (C) g(C) + 1 − g 0 (C) log 1 − g 0 (C) 1 − g(C) = E c g(C) g 0 (C) g(C) log g 0 (C) g(C) − g 0 (C) g(C) + 1 For function with form h(x) = x log x − x + 1, it can be shown that for a large Z > 0, there exists C Z , s.t. h(x) > C Z (x − 1) 2 , 0 < x < Z. Then where d() is defined in Eq. (14). Condition 2 follows with the above inequality.
Condition 3 is obvious if we can get the maximum likelihood estimator (MLE) g n with the full class defined by class ℱ in Eq. (15). In practice, we try to attain the MLE within a function class defined by neural networks, and we need to show that the maximization by neural networks still makes the condition hold. Note that based on the proof of Theorem 1 (Van der Vaart, 2000), ℱ includes both g 0 and g n for large n, i.e., g n is monotone. Hence, to guarantee this monotonicity, we further assume this class attained by neural network is a subclass of ℱ and defined by Now, we switch our focus to the existence of the function g n . To simplify the argument, assume g 0 ′ is the true density function (the derivative of the CDF g 0 ) which has a positive lower bound within [low, high]. From the Universal Approximation Theorem, there exists a function g n ′ (x) = ∑ i v i ϕ w i x + b attained by a neural network, s.t. g n ′ (x) − g 0 ′ (x) ∞ < o(1).
Then for large n, g n ′ also has a positive lower bound.
Next we define g n = ∫ g n ′ dx. From this definition, it is easy to see that g n has form g n (x)=∑ i v i ψ w i x + b , with ψ = ∫ ϕ, that is, g n (x) is also attained by neural network. Besides, we have ∥g n (x) − g 0 (x)∥ ∞ < o(1). In addition, by the fact that for any large n, g n ′ also has a positive lower bound, we know that g n is also a monotone function for any large n. This proves that g n ∈ ℱ *.
In the following, we will show M n (g n ) > M n (g 0 )−op(1) to finish the verification of condition 3 and complete the proof. Since M n g n − M n g 0 ≤ 1 by the fact that |log x| ≤ K|x − 1| for x with positive lower bound and using the boundedness of g 0 and the fact that ∥g n (x) − g 0 (x)∥ ∞ < o(1).

A.1.2 Consistency of f-
The form of f-loss determines its strong reliance on the g function. This dependency shows up in both the consistency and the fixed point solution. Although we can use the g function to directly solve the quantiles, we are still interested in showing that f can reach consistency under some additional assumptions.
We start by showing a stronger version of convergence of g. Previously, we have shown that g is consistent in probability, now assume that we have: sup c ∈ [low, ℎigℎ] g n (c) − g 0 (c) p 0 (I) When g n is attained, f n can be defined as g n −1 , which minimizes the empirical f-loss as q=g n g n −1 (q) . The question is whether the sequence of f n converges to f 0 as expected.
If not, we have g 0 (y q -ϵ) ≥ qδ then |y q -ϵ y q | < ϵ, which is a contradiction! The same statement is true for the second part. With (I), we also have N 1 , s.t. when n > N 1 , we have g n (c) − g 0 (c) < δ. Now, let y n, q = g n −1 (q). We want to show that y q − y n, q < ϵ.

A.2 Extension to conditional distribution
In general, we are estimating the distribution of Y |X to prove the consistency of g. We still rely on the construction of M-estimator, but limit it to a multivariate function family. The major difference for the general case is using the smoothness of the conditional distribution to establish a multivariate function family is a Glivenko-Cantelli while for the marginal case we have shown that a univariate monotone function class is a Glivenko-Cantelli. The detailed proof for the general case with the conditional distribution is otherwise almost the same with the previous proof and is not presented here. In the following, we only present the theorem (Van Der Vaart et al., 1996) for evaluating the bracketing number for a smooth multivariate function class and therefore verifying that the function class is a Glivenko-Cantelli.

Lemma 7
gof(p) = E y, x p(Y , X) [log(p(y | x)], is minimized when the estimated conditional density p(y | x) matches the true density p(y | x): p(y | x) = p(y | x).

Proof
For any estimated density of p(y | x) the difference of log likelihood evaluated by p(y|x) and p(y | x) can be expressed as It can be simplified as, < E x p(X) log( ∫ y p(y | x)/p(y | x) × p(y | x)dy) (Jensen'sInequality) = E x p(X) log( ∫ y p(y | x)dy) = 0 Therefore, the true density always minimizes the expectation.

C.1.1 Learning under a suboptimal setting
We adopt over-parameterized neural network structures and a small learning rate to achieve an environment where a method can easily overfit the data. The learning rate for all methods is fixed to be 1e-5 with ADAM optimizer (Kingma and Ba, 2014). We set the batch size to be equal to the sample size of 100 throughout all methods. For the four benchmarks that estimate points regarding the conditional means with the MSE loss and three conditional quantiles(QR_0.5,QR_0.25,QR_0.75) with the quantile regression losses, we designed the neural network structure as a feed-forward network with two hidden layers both of size 1,000, and it uses ReLU as the layer activation function (Nair and Hinton, 2010). For CN, the g function in this section is designed to be a feed-forward network with two hidden layers both of size 1,000 and the f function is designed to be a feed-forward network with three hidden layers all of size 1,000. The eLU activation function (Djork-Arné et al., 2016) is used for both both g and f in CN.

C.1.2 Network for convergence comparisons
The auxiliary function f is set to be the ground truth for T-g, and a uniform distribution for U-g. In this case, we stop using the over-parameterized network structures. We reduced the network size for computational efficiency as a feed-forward network with two hidden layers with size of 100 and 80; and the f function is designed as a feed-forward network with three PPGPR: The implementation of the parametric Gaussian process regression is also based on python package gpytorch. We used an RBF kernel. The heteroskedasticity in GPR is achieved by modifying the marginal likelihood to PredictiveLogLikelihood according to https://docs.gpytorch.ai/en/v1.1.1/marginal_log_likelihoods, which corresponds to the implementation of PPGPR (Jankowiak et al., 2020). For datasets with size larger than 500, we randomly chose 500 points as inducing points. The system was trained for 200 epochs.
Due to the intractability of storing the PPGPR object for large datasets, we used 1/10 of the airline and 1/3 of the EHR datasets to generate its results.

CQR:
The conformalized quantile regression is implemented according to https:// github.com/yromano/cqr. We use its built-in random forest model with default parameter tuning scheme and model specification. CQR can generate uncertainty interval for any specified nominal level q at a time, but this is not scalable for the sketching of the full distribution, so we skip gof evaluation for CQR. The conditional median estimate using CQR is achieved by first estimating the the uncertainty interval of which the left and right limit correspond to the conditional 49.5 % and 50.5 % quantiles. Then we take their average to obtain an approximation of the median.

EN:
The deep ensemble model is implemented by stacking five heteroskedastic regressions. Each of them were implemented by tensorflow probability, with negative log likelihood for Gaussian distribution as its loss. The hidden layer architecture is identical to g network in CN for a fair comparison. The optimizer is ADAM, and we use the inverse time decay with decay rate 1 and decay step every 100 epochs through an available Keras module 5 . We train with 300 epochs for each experiment. Grid search is employed to tune the a regularization term for variance estimates based on gof. We use the mathematical expression of the predictive variance in Section 2.4 of Lakshminarayanan et al. (2017) to calculate the heteroskedastic error of the outcome.

C.2 Dataset Preprocessing
Except for the EHR dataset, all datasets are normalized in both their input features and outcomes. All categorical variables are re-coded in one-hot forms. Covariates with over half of the entries missing are removed. In the Crime data, due to the strong correlation among its covariates, we extracted the first 15 principle components which account for 86 % of its total variability. For the energy data, the covariate of current time and the outcome of the previous observations are combined to predict the current outcome. For the Airline dataset, we pick the first 800,000 observations and remove observations with missing outcomes to constitute our experimental dataset. For electronic health records data, the dataset is first discretized to month intervals. Missing data are imputed using carry-last-one-forward or the population mean if no value is available. To help with informative missingness, a binary indicator variable is augmented for each feature to mark whether the value is real or imputed. After these preprocessing steps, an LSTM is applied to the data using 200 units to predict the mean of the outcome with squared loss, but only the observed data contributed to the calculation of empirical loss. We train it for 2,000 iterations with batch size 128 and default ADAM optimizer. Then we output the LSTM units at those observed time points as the new covariate space with an augmented variable added to represent the time since last visit for each current visit. Lastly, we combine all the observed time points from all patients, which forms a long datasets.

Appendix Appendix D. Visualizing Additional Results
In this section, we present some additional explorations and experimental results.
First, we visualize the loss of CN during the pre-training and post-training processes. We note that the loss of g-only is exactly the loss for the pre-training process in CN-g. g-only does not have the post-training and it uses the same loss throughout. We include three datasets to visualize their losses. They are the first synthetic example with the heteroskedastic Gaussian distribution, the Airline dataset, and the EHR dataset with an LSTM integrated into CN. The batch size is 128 in all three examples. In the EHR dataset, the batch size for the LSTM modified CN is the complete trajectory of 128 individuals. Each individual has multiple visits, so the 128 batch size include more data points in this temporal setting, which produces smaller variations in their evaluated losses. Figure S1 characterizes the loss functions in the first 1,000 iterations of each of CN's variants. All variants in all three cases plateau within 1,000 interactions. The g-loss stabilizes around 0.5, which matches the theoretical analysis. Figure S2 describes the losses of the full training process. Although each method stabilizes early, we introduce the extended training to help in exploring the outcome spaces more extensively and refining the learned distributions. It is observed that all variants of CN stay stable for extended training iterations.
The order of the outcomes in the decomposition could be rearranged but that does not impact the equivalence between the left and right side. The training of the joint conditional distribution is characterized in Figure S5. For an output with m dimensions, we build m models sequentially. In the first model, the feature space is X, which is used to learn the distribution of Y 1 |X. For the second model, the feature space is the combination of Y 1 and X to enable us to learn the distribution of Y 2 |Y 1 ,X. As we move along the chain, we sequentially combine the feature space and the observed output of the last model to form a new feature space to train the present outcome. The model complexity increases in linear time regarding fitting m models.

Figure S5:
The sequential training scheme for the m dimensional output {Y 1 ,· · ·, Y m }|X.
To make inferences in this framework we will use a sampling technique. For a single random draw, we start by drawing a y 1 given x with the learned model Y 1 |X. Then we feed the observed y 1 and x to the model Y 2 |X, Y 1 as its input feature to draw a random y 2 and we continue this procedure until a full set of {y 1 , · · · , y m } is drawn.
We use two synthetic cases with two-dimensional and correlated outputs to illustrate this strategy. In the first case, we use a bi-variate Gaussian distribution. For each observation i, the outcome y i is synthesized, The input space is x i = [μ i,1 , μ i,2 , σ i,1 , σ i,2 ]. The outcome y i,1 and y i,2 both have heteroskedastic errors and their correlation ρ = 0.5. The parameters are simulated as: μ i, 1 N(0, 2), μ i, 2 N(0, 2), σ i,1 ∼ Unif(1, 2) and σ i,2 ∼ Unif(1, 2). We draw 2,000 training samples to learn this joint distribution. To demonstrate that the joint distribution is properly captured, we visualize the estimated joint CDF of a random x i and compare it to the true value. The joint CDF here is defined as F z 1 , z 2 | x i = ℙ y i, 1 < z 1 ∩ y i, 2 < z 1 | x i . Figure   S6 gives an example of the estimated joint CDF. We observe that the estimated value and the theoretical value are very close. Similar results are achieved across many random draws.

Figure S6:
Learning a bivariate Gaussian distribution. Given a random draw xi = [2.977, 1.352, 1.139, 1.312], 6(a) and 6(b) are the true and estimated CDFs. Their absolute difference in 6(c) demonstrates that the joint CDF is accurately learned.
For the second case, we demonstrate the learning of joint CDF with circular distribution whose parameters are defined on the polar coordinate system. Specifically, we used the von Mises distribution to synthesize the angle ψ for our circular distribution. The von Mises distribution is also known as circular normal distribution of which the PDF is defined as the following: f(ψ | μ, κ) = e κcos(ψ − μ) 2πI 0 (κ) .
The parameter μ and κ govern the location and shape of the distribution, and I 0 (κ) is the normalizing constant. The density of ψ is symmetric around μ and it characterizes the distribution of ψ in the support (0, 2π]. For each individual i, we set the input spaces to be two scale parameters for the von Mises distribution: x i = [κ 1,i , κ 2,i ], κ 1,i ∼ Unif(0.5, 2) and κ 2,i ∼ Unif(0.5,2). Then we generated two angles with the von Mises distribution given [κ 1,i , κ 2,i ]: ψ i,2 ∼ von Mises(0, κ 1,i ) and ψ 2,i ∼ von Mises(0, κ 2,i ). With fixed distance r = 1, these two angles correspond to the two points in the polar coordinates: p i,1 = (1, ψ i,1 ), p i,2 = (1, ψ i,2 ). Then we convert the two points into the Cartesian coordinate system: p i,1 = (cos(ψ i,1 ), sin(ψ i,1 )) and p i,2 = (cos(ψ i,2 ), sin(ψ i,1 )). Lastly, the two-dimensional outputs are created by taking the average of the first and second coordinates: y i = [y i,1 , y i,2 ], y i, 1 = cos ψ i, 1 + cos ψ i, 2 2 and y i, 2 = sin ψ i, 1 + sin ψ i, 2 2 . We generate 5,000 training samples for this case. The main purpose of this case is to show that CN can also learn the joint distribution complicated by the transformation between the two coordinate systems.

Figure S7:
Given the random drawn [κ1,i = 1.409, κ2,i = 1.600], 7(a) and 7(b) are scatter plots of 5,000 random samples simulated from the estimated and true distributions. We then estimate their joint CDFs respectively in the Cartesian coordinate system. The absolute difference in 6(c) demonstrates that the joint CDF is accurately learned.
We again visualize the estimated distribution in Figure S7. We observe a close match between the estimated value and the true value concerning the joint CDF despite the increased complexity.
There are still a few challenges that require future research to make this framework more practical. Although in theory the chain can be randomly ordered, it may impact the space searching efficiency when the dimension goes higher. Thus, additional research regarding the relationship between each output is necessary to create the most efficient order. In a one-dimensional setup, we have shown that using g over f improved accuracy. However, using g for multiple outputs may bring unwanted computational costs. As g models the CDF, we need to numerically invert it and use the inverse CDF instead. On the other hand, although f naturally forms a framework of drawing random samples, its weakness in the tails could also impact the quality of sampling given higher dimensions. Hence, either improving the sampling easiness of g or rectifying the tail property of f is a challenge worthy of further research.
As an alternative to this chain structure, the distribution can be modeled jointly by modifying the g-loss to expand all possibilities of comparing Y 1 , · · · , Y m to any random value z = {z 1 , · · · , z m } in the output space: 1 Y 1 < z 1 , …, Y m < z m , 1 Y 1 ≥ z 1 , ⋯, Y m < z m , ⋯, 1 Y 1 ≥ z 1 , ⋯, Y m ≥ z m . One caveat of this alternative is that the computational cost increases exponentially since each z elicits 2 m sets of comparisons. Moreover, the design of f in a one dimensional framework is not straightforwardly extendable for multiple outputs due to the difficulty of defining the higher dimensional inverse CDF. Searching the space with a uniform distribution will be subject to the curse of dimensionality given a large m. The strength of this tactic is its robustness to the ordering of outputs. Therefore, we are interested in conducting further experiments to assess the properties of this alternative in the future. Illustration of the CN framework. 1(a) describes training for a conditional quantile y(q, x) directly as the objective function to ensure calibration. However, the dashed arrow implies that the objective function does not produce a useful gradient. 1(b) gives the g-network, which helps with the non-differentiable objective function in Eq.
(2). In this framework, g and f are jointly trained to learn the CDF and conditional CDF, and they are connected by Eq.
(1). 1(c) gives the final mapping to generate the conditional quantiles after the network has been trained. Visualizations of uncertainty estimates in a 1-d synthetic dataset. 2(a) shows that MSE and Quantile Regression (QR) methods essentially fit all data exactly in these settings. 2(b), 2(c) visualize the median and uncertainty estimate of g given a fixed f function from uniform distribution (U-g), a fixed f function from the theoretically optimal distribution (T-g). 2(d) and 2(e) give the results of g (CN-g) and f (CN-f) functions learned under the complete collaborating networks scheme.   Scatter plot of the optimal 90 % interval widths against the estimated 90 % interval widths for all evaluation samples in the synthetic heteroskedastic Gaussian data. Points scattering closely on the 45 degree dashed diagonal line indicates a good agreement with ground truth. The full CN approach in 4(a) has overall the best agreement, whereas the competing methods do not capture the varying width as well. Note that DP in 4(c) and GPR in 4(d) both assume a fixed variance in each ensemble model, so intervals vary only slightly based on the samples in DP and the functional uncertainty in the GPR.  Visualization of the estimated CDFs compared to ground truth survival curves (TH) for two random samples of the Weibull synthetic data. CN-g and g-only closely mimic the true survival curves for both of the random samples, and CN-g slightly outperform g-only in the second random example in 7(b). CN-f struggles on the tails of the distribution. We believe that this is due to the fact that the network does not see that many tail samples and can tend to shrink towards the mean; regardless, CN-f is still a competitive method. Gaussian based approaches struggle as their approximations for the Weibull distribution (asymmetric) are based on a symmetric distribution (Gaussian).

Figure 8:
Visualization of the difference between the nominal and the empirical coverage for two real-world datasets: Airline and EHR. Curves lying closely on the 0 % horizontal line represents a good calibration result. In both cases, the three variants of CN consistently calibrate all nominal levels. Visualization of the interval sharpness for two real-world dataset: Crime and EHR. The true coverage for every nominal level of coverage is calculated for each method. These true coverages (x axis) are plotted against the median interval widths under that true coverage level (y axis). A curve in low position can be interpreted as given a level of true coverage, it generates comparatively narrower intervals, which gives sharpness. CN-g typically generates sharper intervals for both cases.  Metrics on the heteroskedastic Gaussian synthetic data. In this case, the proposed methods CN-g and g-only essentially match the theoretical optimal values, and match the performance of EN, which assumes the correct model form. We use boldface for the first two top-performing methods under each evaluation metric.